Sunday, December 14, 2008

Why scientific computations fail

(I was busy this fall with another project. Now it's time to return to scientific programming)


The previous post covered several levels of tests for assessing the correctness of scientific computation. Now let us look at reasons why the results might not match what is expected.



  1. Programming errors
  2. Mistakes in derivation and transcription of equations
  3. Violations of assumptions or conventions
  4. Errors from floating point
  5. Errors and undesirable behavior from the algorithm (often problem-dependent)

    • Algorithms that have been well-studied by numerical analysts
    • Those that have not




Programming errors


The usual sorts of problems in programs (array indices, memory problems, etc). This category of problems can be tested with unit tests and can make use of tools and practices from mainstream software engineering.


Mistakes in derivation and transcription of equations


This is similar to the implementation not matching the specification in traditional software engineering. With scientific computation, though, the "specification" could be more precise in principle (equations and algorithms), and computer algebra systems could aid with performing and verifying steps in the derivation.


Violations of assumptions or conventions


Usually happens during derivation of the equations. A stronger connection between the derivation and the final code might help this.


Errors and instabilities from floating point


Accumulated round-off, loss of precision from subtracting nearly equal numbers, and other problems. For more subtleties than you every wanted to know about, visit W. Kahan's site.
One of the most annoying parts about floating point is lack of good model for understanding (and analyzing) floating point errors. I will return to this issue in a future post.


Errors and instabilities from the algorithm.


This category might be better divided into two subcategories - algorithms that have been well-studied by numerical analysts (like matrix operations, certain ODE's and PDE's, optimization), and those that have not. Many algorithms have been well-studied and have nice packages available to use as black-boxes. However, most scientific programs are a composition of well-known algorithms along with some problem-dependent features as well.


Often times the behavior of the algorithm depends on the details of the physical system (ie, is it a metal or insulator)?


This category also includes science-based and numerical approximations.


Now, how do we go from test case failure to deciding which of these categories the problem falls into (and then, of course, to locating the precise problem)?


  • Unit testing will help with the first category. But not so much with the rest. Unit testing is helpful when you want to make sure things haven't changed. The hard part is getting it right the first time.
  • The different levels of tests in the previous post should give some insight into which category the problem might occur.


First I want to look at various models of floating point error and see if we can gain some understanding of the intrinsic uncertainty in a calculation. This seems to be a necessary prerequisite for answering the question of 'how close is close enough?', in order to decide whether a test has actually failed or not.

Sunday, July 13, 2008

Assessing the correctness of scientific computation

In the previous post I promised some thoughts on assessing the correctness of scientific computation.


Computational science programs are typically assessed with a series of test cases, ranging from order-of-magnitude estimates ("those numbers don't look right") to comparing with experimental data. I've attempted to divide these test cases into categories, where each category shares similar expectations as to what 'close enough' means and has similar implications into potential causes for discrepancies.




  1. Order of magnitude estimates from general principles, back of the envelope calculations, or experience with similar problems.
  2. Analytically solvable cases.
  3. Small cases that are accurately solvable
    For small systems, or simple cases, there are often multiple solution methods, and some of them are very accurate (or exact). The methods can be compared for these small systems. An important feature of this category is any approximations can be controlled and the effects made arbitrarily small.
  4. Results from the same (or similar) algorithm
    Comparing the same system with the same algorithm should yield close results, but now there is additional uncertainty in the exact implementation, and the sensitivity to input precision (particularly when comparing results from a journal article)
  5. Results from a different algorithm
    This looks similar to the second category (comparing with exact or more precise algorithms), but this is the case where there a multiple methods (with different approximations) for handling larger systems. There are likely to be more approximations (and probably uncontrolled approximations) involved.

    (In electronic structure there is Hartree-Fock(HF), several post-HF methods, Density Functional Theory, and Quantum Monte Carlo. And possibly a few others.) Now one has to deal with the all the above possibilities for two programs, rather than just one.
  6. Experimental results
    In some ways this is quite similar to the the previous category, except determining "sufficiently" close requires some understanding of the experimental uncertainties and corrections in addition to possible program errors.



The testing process for each case involves running the test case, deciding whether the program result is 'close enough'. If it is, proceed to the next test case. If it is not, work to discover the cause of the discrepancy. Next post I'll look at a list of possible causes.

Saturday, July 12, 2008

Engineering scientific software

Some pointers from Greg Wilson on engineering scientific software.


The first pointer is to the "First International Workshop on Software Engineering for Computational Science and Engineering".


A couple of Wilson's short summaries of papers from this workshop
(from recent research reading)

Diane Kelly and Rebecca Sanders: “Assessing the Quality of Scientific Software” (SE-CSE workshop, 2008). A short, useful summary of scientists’ quality assurance practices. As many people have noted, most commercial tools aren’t helpful when (a) cosmetic rearrangement of the code changes the output, and (b) the whole reason you’re writing a program is that you can’t figure out what the answer ought to be.


Judith Segal: “Models of Scientific Software Development”. Segal has spent the last few years doing field studies of scientists building software, and based on those has developed a model of their processes that is distinct in several ways from both classical and agile models.


What I found interesting:

Kelly and Sanders suggest that the split between verification and validation is not useful, and propose 'assessment' be used as an umbrella term. (I will return to the point in a later post)



The second pointer is to the presentations at TACC Scientific Software Days.


One of the presentations there was by Greg Wilson himself on "HPC Considered Harmful". There is one slide on testing floating point code (slide 15). In particular the bullet point testing against a tolerance is "superstition, not science". (And he's raised this point elsewhere)


This statement has bothered me since I first read it, and so the next few posts will explore some of context surrounding this question. The first post will list different methods for assessing the correctness of a program, the next post will categorize possible errors, and finally a post looking at 'close enough'.

Thursday, May 29, 2008

Computing an unbiased inverse

In a previous entry, Reweighting is biased, I had discussed the problem of a finite-sample size bias in reweighting. In this entry I will investigate a possible solution.


But first, I had asked if this was known in the literature. A comment by Lee Warren pointed to this article, Population size bias in descendant-weighted diffusion quantum Monte Carlo simulations, which discusses this very issue (in a slightly different context, but the root issue is the same). In that context the problem is even more severe since the straightforward solution (increase N) carries a heavier computational cost.


One of the crucial issues is finding a good estimator for the inverse (1/D). The naive estimator (simply do the division) is biased. So the question is: are there any other ways of performing division that could be used? Yes, in digital arithmetic there are a couple of algorithms for division using only multiplication and addition - Newton-Raphson and Goldschmidt (see the Wikipedia entry on digital division)


In general, to have an unbiased estimator, each sample of the random variable must enter the estimator linearly (alternately, we can say each sample should only be used once and then discarded)


The Newton-Raphson iteration for the inverse (1/D) is xi+1 = xi(2-D xi). To make this unbiased, in each iteration the value of D needs to be an independent sample, and the values of xi from previous iterations also need to be computed from independent samples. One can think of the algorithm forming a tree, with each iteration branching to two children (each child is one of the xi computed by the previous iteration), until reaching the leaves representing the initial iteration (x0 = T1 + T2 D)


I tested this algorithm, and it appears to give unbiased results. There is a catch, though. The result has a large variance. Using the average of all the samples with the naive estimator yields a bias that is much smaller than the variance of the unbiased estimator (hence the mean squared error is a more appropriate metric for the combination of variance and bias)


Open questions/issues:


  • Is there some modification of the algorithm that will produce a smaller MSE than the naive estimator? Or even an MSE that isn't too much larger?

  • How many iterations are necessary - how to analyze the convergence of this stochastic Newton-Raphson iteration?

  • Try out the Goldschmidt algorithm.

Sunday, April 06, 2008

Reproducible Research

There were a couple of recent posts on Greg Wilsons's blog about reproducible research. The first is a pointer to a paper about WaveLab. The paper is from 1995, although it appears the philosophy hasn't changed much looking at a more recent (2005) introduction on the website. The second post contains a reference to the Madagascar project.


These projects follow the work of Jon Claerbout on Reproducible Research.


Thoughts


  • This work is all about designing a workflow for computational research, and creating tools to support that workflow. Much of it seems quite banal, but part of the whole purpose of doing this is to free you from having to think about the boring and repetitious steps.
  • Levels of reproducibility

    1. Yourself, right away (when tweaking figures, for instance)

    2. Another research group member, or yourself 6 months later

    3. Someone in another group, (a) able to run the code and get the same results, and (b) able to understand the code

    4. Someone in another group, able to use your description to write their own code that gets the same results


    The last level is the gold standard in scientific reproducibility. Experiments are reproduced from published descriptions (ideally, although the descriptions aren't always complete). Or theorists rederive intermediate steps leading to a paper's conclusion.

    In software development, levels 1-3 are what counts. (It's not very useful software development if others can't run your code and get expected results.). The last level is similar to multiple implementations of a standard (eg, C++ compilers from multiple vendors.)

  • The referenced works seem to focus mostly on processing of raw data (images, seismic imaging, etc) and the processing steps to produce a final result.


    How would this apply in other fields? Say, ones with smaller input data sets, but much higher computational demands.


    The Madagascar project, for example, uses SCons to coordinate the steps in producing research output. What happens when one of the steps involves several weeks of supercomputer time? Two immediate problems that spring to mind - one probably doesn't want this step launched without warning by the script. Furthermore, the steps for launching such a job are likely to be very site-specific, making it difficult for others to reproduce elsewhere.


  • Somewhat related entries of mine: Keeping notes and Solving Science Problems

Wednesday, March 19, 2008

Notes from PyCon 2008

I attended PyCon 2008 this past weekend, and had a good time meeting people in the Python commmunity.


Some notes


  • I tried to give a lightning (5 minute) talk about sympy and code generation. Unfortunately, Ubuntu 7.10 on my laptop (Dell Vostro) failed to display to the external output, so I was unable to present to the larger audience. However, I was able to give the talk to the people at the science BOF session. The slides are available here

  • At the science BOF session I met Michael Tobis, the author an interesting system called PyNSol. It uses python to describe the differential equations, and generates Fortran for performance.

  • The python 2 to 3 converter looks interesting. Not so much for the conversion of python versions, but for the infrastructure developed for the tool.

  • Enthought was showing off some multi-touch displays. Being able to interact with 3D visualizations (and Google Earth) via gestures for zoom, pan, and rotation made for a nice demo. The real question is if this type of display would be useful for actual use.

Monday, March 17, 2008

Code generation now uses templates

In Quameon, the derivatives for the Gaussian basis functions are computed with sympy and the python code is generated. Creating the code that surrounds the generated code is a bit of a tedious procedure - the syntax tree needs to be constructed by hand. But no more! The surrounding code can now written in python and a parser (written using pyparsing) will create the syntax tree for you.


Some details:

Identifiers surrounded by percent signs (eg %name%) get turned into a template node type for subsequent replacement.


Example template:


class %orb_name%:
def compute_value(self,alpha,v,r2):
x=v[0]
y=v[1]
z=v[2]
%value%
return val


The code replaces %orb_name% with the orbital name (s,px,py,px,etc) and %value% gets replaced with the expression for the value of the orbital (the methods for the derivatives not shown - see codegen/primitive_gaussian/deriv.py in SVN for the full template.)